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Abstract —We consider the problem of clustering noisy finite- 
length observations of stationary ergodic random processes 
according to their nonparametric generative models vrithout 
prior knowledge of the model statistics and the number of 
generative models. Two algorithms, both using the -distance 
between estimated power spectral densities (PSDs) as a measure 
of dissimilarity, are analyzed. The first algorithm, termed nearest 
neighbor process clustering (NNPC), to the best of our knowledge, 
is new and relies on partitioning the nearest neighbor graph of the 
observations via spectral clustering. The second algorithm, simply 
referred to as fc-means (KM), consists of a single A:-means itera¬ 
tion with farthest point initialization and was considered before 
in the literature, albeit with a different measure of dissimilarity 
and with asymptotic performance results only. We show that 
both NNPC and KM succeed with high probability under noise 
and even when the generative process PSDs overlap significantly, 
all provided that the observation length is sufficiently large. Our 
results quantify the tradeoff between the overlap of the generative 
process PSDs, the noise variance, and the observation length. 
Finally, we present numerical performance results for synthetic 
and real data. 

I. Introduction 

Consider a set of N noisy length-M observations of sta¬ 
tionary ergodic random processes stemming from L < N 
(typically L ^ N) different generative models. We want 
to cluster these observations according to tbeir generative 
models without prior knowledge of the model statistics and 
the number of generative models L. This problem arises in 
many domains of science and engineering where large amounts 
of data have to be divided into meaningful subsets in an 
unsupervised fashion, e.g., for efficient processing or storage. 
Examples include clustering of audio and video sequences 
Q, electrocardiography (ECG) recordings Q, and industrial 
production indices Q. 

Many existing random process clustering methods quan¬ 
tify the dissimilarity between observations using the Eu¬ 
clidean distance between either estimated model parameters 
1^, or estimated cepstral coefficients Q, Q, or normal¬ 
ized periodograms Q. Other methods rely on divergences 
(e.g., Kullback-Leibler divergence) between normalized peri¬ 
odograms 1^, Q or use the distributional distance between 
processes Q , Q. In all cases the resulting distances are then 
fed to a standard clustering algorithm such as fc-means or hier¬ 
archical clustering. Another line of work employs a Bayesian 
framework to infer the cluster assignments, e.g., according to 
a maximum a posteriori criterion ng- While many of these 
approaches have proven effective in practice, corresponding 
analytical performance results are scarce. Moreover, existing 


analytical results are mostly concerned with the asymptotic 
regime where the observation length tends to inhnity (see, e.g., 
the finite observation-length regime has attracted 
virtually no attention. 

Contributions: We consider two process clustering al¬ 
gorithms that apply to nonparametric generative models and 
measure dissimilarity between observations through the L^- 
distance between estimated power spectral densities (PSDs). 
The hrst algorithm, termed nearest neighbor process clustering 
(NNPC), is inspired by the subspace clustering algorithm 
described in im and, to the best of our knowledge, has 
not been considered in the literature before. NNPC relies on 
partitioning the q-nearest neighbor graph of the observations 
via spectral clustering (the number of nearest neighbors g is a 
parameter of NNPC). The second algorithm, which will simply 
be referred to as A:-means (KM), consists of a single A:-means 
iteration with farthest point initialization in and was hrst 
proposed in Q with a different distance measure. The original 
formulation of KM was shown in to deliver the correct 
segmentation with probability tending to 1 as the observation 
length M —oo. 

Assuming real-valued zero-mean stationary ergodic Gaus¬ 
sian processes as generative models, we characterize the per¬ 
formance of NNPC and KM analytically for hnite-length ob¬ 
servations contaminated by independent additive white Gaus¬ 
sian noise. We hnd that both algorithms succeed with high 
probability even when the (true) PSDs exhibit signihcant 
overlap, all provided that the observation length is sufficiently 
large. Our analytical results quantify the tradeoff between 
observation length, noise variance, and distance between the 
(true) PSDs of the generative models. Einally, we evaluate the 
performance of the two algorithms on synthetic and on real 
data. 

Notation: We use lowercase boldface letters to denote 
vectors, uppercase boldface letters to designate matrices, and 
the superscript ^ stands for transposition. Vi is the zth entry 
of the vector v. Eor the matrix A, j designates the entry 
in row i and column j, ||A|| 2_,.2 its maximum singular value, 
and |1A||^ := j its Erobenius norm. I stands 

for the identity matrix. The ith element of a sequence x is 
denoted by x\i]. Eor a positive integer N, [iV] designates 
the set {1, 2,..., N}. E[A] is the expectation of the random 
variable X and the notation K ~ A indicates that the random 
variable Y has the same distribution as X. We say that a 
subgraph H of a graph G is connected if every pair of nodes 


in H can be joined by a path with nodes exclusively in H. A 
connected subgraph i? of G is called a connected component 
of G if there are no edges between H and the remaining nodes 
in G |[T3). 

II. Formal problem statement and algorithms 


We consider the following clustering problem: Given the 
unlabeled data set A = Ai U • • • U of cardinality N, 
where X^ = contains the contiguous noisy length- 

M observations x] of the real-valued stationary ergodic 
random process X^^^[m],m S Z, corresponding to the fth 
generative model, find the partition Xi,..., X]^. The statistics 
of the generative models and of the noise processes, as well as 
the number of generative models, are all assumed unknown. 

Both clustering algorithms we consider are based on the 
following distance measure. Denoting the PSD of by 
sW(/), / e [0,1), we define the distance between X^^'> 
and XW by := ^“ s(^H/)|d/. 

For processes with unit power, i.e., Jq s^^\f)df = 1, 
i G [L], the factor 1/2 ensures that d{X^^\ X^^'>) takes 
on values in [0,1]. d{X^^\ X^^'>) essentially quantifies the 
dissimilarity between the support sets of and i.e., 
it is close to 1 when and are concentrated on disjoint 
frequency bands. 

We now present the NNPC and the KM algorithms. Recall 
that NNPC is inspired by the subspace clustering algorithm 
introduced in and KM is obtained by replacing the 

distance measure in Algorithm 1 in by the distance 
measure d defined above. 


The NNPC algorithm. Given a set X of N length- 
M observations, the number of generative models L (the 
estimation of L from X is discussed below), and the parameter 
q, carry out the following steps: 

Step 1 : For every Xi G X, estimate the PSD Si{f) via the 
Blackman-Tukey (BT) estimator according to 

M-l 

Si{f) ■= ^ p[m]fi[m]e“*^’^^’”, where (1) 

^ M—\m\ 

ri[m] ^ xfn + m]xi[n], |m| < M - 1, 

n—1 

and g[m\, m G Z, is an even window function (i.e., g[m] = 
g[—m\) with p[0] = 1 and nonnegative bounded discrete-time 
Fourier transform g{f), i.e., 0 < g(f) A < oo, f G [0,1). 
Identify the set Ti C [M] \ {i} of cardinality q defined through 


d(xi,Xj) < d(xi,Xp), for all j G % and all p f. %. 
Step 2: Let •Zj G be the vector with ith entry 

exy)(—2d(xi,Xj)), if j G %, and 0, if j ^ 7). 

Step 3 : Construct the adjacency matrix A according to 
■ TF, where Z = [zi ... Zjv]- 


A = Z 


Step 4: Apply normalized spectral clustering [131 to (A,L). 


The KM algorithm Q. Given a set X of N length-M 
observations and the number of generative models L, carry 
out the following steps: 

Step 1 : Initialize Ci := 1 and X^ := {} for all I G [L\. 

Step 2: For every Xi G X, estimate the PSD Si(f) via the BT 


estimator Q- 

Step 3: for p = 2 to L do: 

Cp := argmaxig[Af] minfg[p_i] d(x^,XcF- 

Step 4: for i = 1 to A do: 

£* G- argmin^g[i] d(xi,XcF 

X(* G- Xg* U {xi} 

Both algorithms are based on comparing distances between 
observations, and are therefore meaningful only if the obser¬ 
vations are of comparable average power. 

We henceforth denote the nearest neighbor graph with 
adjacency matrix A obtained in Step 3 of NNPC by G. The 
parameter q in the NNPC algorithm determines the number 
of edges in G. Choosing q too small may result in the 
observations stemming from a given generative model forming 
multiple connected components in G and hence not being 
assigned to the same cluster. This problem can be countered 
by increasing q, which, however, increases the chances of 
observations coming from different generative models being 
connected in G and hence being misclustered. The issue of 
how to choose q in practice is further discussed in the next 
section. 

In the NNPC algorithm L may be estimated using the 
eigengap heuristic | [T3| , which relies on the fact that the 
number of zero eigenvalues of the normalized Laplacian of 
G corresponds to the number of connected components of G. 

We finally note that the BT PSD estimates 0 can be 
computed efficiently via the FFT 

III. Pereormance results 

For our analytical performance results, we assume that the 
x) for given i, are obtained as contiguous length-M obser¬ 
vations of X^^'1[m\ := X^^'1[m]-\-W^^'^[rri\,m G Z, where 
is zero-mean stationary Gaussian with PSD s^^\f), and 
is a zero-mean white Gaussian noise process with variance 
and independent of X^^\ Note that the noise statistics 
are identical across the the generative models. The autocor¬ 
relation functions (ACFs) := /q^ are 

assumed absolutely summable, i.e., J2m=-oo < oo, 

I G [L], which implies ergodicity of the corresponding X^^\ 
Further, we assume that the PSDs are normalized accord¬ 
ing to Jq = 1, ( G [L], and we define B := 

max^g[i] supjg[o i) O™ performance results depend 

on the maximum ACF moment p,rnax := max^gj^j where 
:= Em=-oo \h[m]\\r<^‘^'>[m]\ with 

9 im)il-\m\/M), for |m| < M 
^ ' ' 1 1, otherwise. ^ ^ 

(i) 

For each i, the x) may either stem from independent realiza¬ 
tions of X^^'^ or correspond to different (possibly overlapping) 
length-M segments of a given realization of X^^\ In the latter 
case the x) will not be statistically independent in general. 

Our main result for NNPC ensures the no false connections 
(NFC) property defined as follows. 

Definition 1 (No False Connections Property). G has no false 
connections if, for all i G [L], nodes corresponding to Xg are 
connected to other nodes corresponding to Xg only. 



Although the NFC property alone does not guarantee correct 
clustering, it was found to be a sensible performance measure 
for subspace clustering algorithms (see, e.g., pd] , fid)). To 
ensure correct clustering one would additionally have to ensure 
that the subgraphs of G corresponding to the Xi are connected. 
Proving connectivity for NNPC appears to be difficult in the 
noisy finite observation-length regime. 

Theorem 1. Let q < min^g[i](nf —1) and let X be generated 
according to the data model described above. Then, the 
clustering condition 

min 

2, /2l0gM 

> 8A(B + (7 )G—— -h 2/i„iax (3) 

guarantees that G has NFC with probability at least 1 — 
2N/M‘^. 

Our main result for KM comes in terms of a stronger 
performance guarantee, namely it ensures correct clustering. 
This is thanks to the fact that KM does not have a spectral 
clustering step and is hence much easier to analyze. On the 
other hand NNPC typically outperforms KM in practice. 

Theorem 2. Let X be generated according to the data model 
described above. Then, under the clustering condition ([^, the 
partition Xi,..., Xj, of X inferred by KM corresponds to the 
true partition Xi,..., Xl with probability at least \ — 2N/M‘^. 

The proofs of Theorems [T] and are given in the Appendix. 
Theorems [^andj^essentially state that NNPC and KM succeed 
even when the PSDs of the overlap significantly 

and the observations are contaminated by strong noise, all this 
provided that the observation length M is sufficiently large and 
the window g smoothes the BT PSD estimates appropriately 
so that /Tniax IS Small. The clustering condition ([^ quantifies 
the tradeoff between the (maximum) amount of overlap of the 
(through the observation 

length M, and the noise variance a^. The first term on the 
RHS of (|^ vanishes as M ^ oo. For with small essential 
support relative to M, i.e., [m] « 0 for m > Mq with 

Mo <C M, choosing g such that g[m] « 1 for m < Mg yields 
Mmax 1 (since h[m] « 1 — g[m] ~ 0 for m < Mf). Hence, 
the clustering condition (|^ can indeed be satisfied if the 
decay rapidly enough. Note that the choice of g affects the 
constant A. In particular, increasing the width of g results in 
larger A. 

To ensure that the probability of success of NNPC and KM 
is high, we need to take M ^ V^, i.e., the observation length 
should be large relative to the square root of the number of 
observations. 

We next address the choice of q for NNPC. The condition 
q < min^g[j^](n^ — 1) in Theorem depends on the unknown 
quantities ng and admits a large range of values for q only if 
the clusters have balanced sizes, i.e., if ne tuN/L, I € [L]. 
In practice, however, the performance of NNPC is observed 
to be quite robust w.r.t. the choice of q even when the cluster 
sizes are imbalanced. 

We would like to point out that virtually all existing 
analytical performance results for random process clustering 
apply to the asymptotic regime where M ^ oo, for N 


fixed. The results that come closest to ours in spirit can be 
found in where it is shown, in the asymptotic setting, 

that observations coming from different generative models 
can consistently (in the statistical sense) be discriminated via 
a PSD-based distance measure, provided that the PSDs of 
the generative models differ on a set of positive Lebesgue 
measure. 

Finally, we note that the random process clustering problem 
considered here can be cast as a subspace clustering problem 
simply by interpreting the observations xi as vectors in 
Numerical results, not reported here, demonstrate, however, 
that NNPC clearly outperforms its subspace clustering cousin, 
the thresholding based subspace clustering (TSC) algorithm 
0- This is thanks to NNPC exploiting the stationarity of 
the generative models, a property that is usually inexistent in 
subspace clustering and is hence not taken into account by 
TSC. 

IV. Numerical result^]] 

Synthetic data: We evaluate the performance of NNPC 
and KM in terms of the clustering error (CE), i.e., 
the fraction of misclustered observations. We consider 
L = 3 generative ARMA models with PSDs Sa,b(/) = 

I where a and b are 

coefficient vectors of length n + 1 and m + 1, respectively. 
We choose ai = l,bi = [3/4 1 —7/4 1/2],a2 = l,b 2 = 
[1/2 5/4 - 3/2 3/4], and ag = [1 - 1/5 2/5 1/10],bg = 1, 
and then normalize the coefficient vectors to ensure that the 
processes have unit power. It can be seen in Figure E] (left) 
that the process PSDs overlap significantly. For the BT PSD 
estimator, we use a Gaussian window g with standard deviation 
50 for both NNPC and KM. We sample ni = 25 independent 
observations from each generative model, and for each value 
of M the CE is averaged over 200 such problem instances. 
The number of generative models L = 3 is assumed known. 
The performance of NNPC is rather insensitive to the choice 
of g for 10 < (j < 20 (corresponding results are not shown 
here) and we set q = 10. Eigure [T] (right) shows the resulting 
average CE as a function of M. NNPC is seen to consistently 
outperform KM. Eor larger M the performance difference 
between NNPC and KM becomes less pronounced. Note that 
for the choice of model parameters and window function in 
this experiment the clustering condition Q is not satisfied. 




Fig. 1. PSDs of the generative models (left) and average CE as a function 
of M (right). 

Real data: We consider the problem of clustering se¬ 
quences of human locomotion according to the activities per¬ 
formed. Specifically, we repeat the experiment from 0^ 0 

^Matlab code available at http://www.nari.ee.ethz.ch/commth/research/ 























which uses the Carnegie Mellon Motion Capture databas^ 
containing motion sequences of 149 subjects performing var¬ 
ious activities. The motion sequences describe the positions 
of markers on different body parts over time, recorded using 
optical tracking. The experiment is based on subjects #16 and 
#35 for which the database contains 49 and 33 sequences, 
respectively, labeled either as “walking” or “running”. We 
assume L = 2 to be known and we cluster the sequences 
describing the motion of the marker placed on the right foot 
of the subject. Differences in sequence lengths are accounted 
for by zero-padding to the maximum sequence length. Fur¬ 
thermore, we normalize the BT PSD estimates to unit power. 
Table [I] lists the CE as well as the entropy of the clustering 
confusion matrix S (defined in p3] Sec. 6]) for q = 6 (i.e., the 
value in [min^gj^j (n^ — 1)] which yields the lowest CE and 
S concurrently). Comparing S to the corresponding values 
reported in Q, p7j , we find that for subject #35 NNPC 
and KM perform better than the algorithm proposed in 0 
and match the performance of the algorithm considered in 
0, while for subject #16 NNPC outperforms both of these 
algorithms as well as KM. 



NNPC 

KM 

subject 

CE 

s 

CE 

S 

#16 

0.02 

0.09 

0.24 

0.55 

#35 

0 

0 

0 

0 


TABLE I 

CE AND S FOR CLUSTERING HUMAN MOTION SEQUENCES 

Appendix 

The central element in the proofs of Theorems and is 
the following result, proven at the end of the Appendix. 

Theorem 3. Consider a data set X generated according to 
the data model described in Section m Then, the clustering 
condition 0 implies 

mm mm dix, ,x) ) > max max dix) ,x) ) 

k,ie[L]-. zgK], ^ ielL] i.iGK]: ^ 

k^t jG[nfc] 

(4) 

with probability at least 1 — 2iV/M^. 

Theorem essentially says that under the clustering con¬ 
dition 0 observations stemming from the same generative 
model are closer (in terms of the distance measure d) than 
observations stemming from different generative models. We 
now show how Theorems [T] and follow from Theorem 0 
Proof of Theorem The NEC property for q < 
min^g[^](n^ — 1) is a direct consequence of Q, which by 
Theorem is implied by the clustering condition ([^. 

Proof of Theorem ^ The proof is effected by showing 
that in Step 3 KM selects an observation with a different 
generative model in every iteration, i.e., {icflfei contains 
exactly one observation of each generative model, provided 
that the clustering condition Q and hence, by Theorem]^ Q 
holds. The argument is then concluded by noting that, again 
by Q and hence by (|^, the partition Xi,..., Xl obtained in 
Step 4 corresponds to the true partition Xi,..., X]^. 

^available at http://mocap.cs.cmu.edu 


Suppose that after the pth iteration in Step 3 of KM the 
observations Xc^jt G [p], all come from different generative 
models, and assume w.l.o.g. that the generative model under¬ 
lying Xcf has index £, for £ G [p]. Eor iteration p-f 1, it follows 
from Q that 

max min d(xi,Xc,) 
iG[/V]^6[p] 

= max< max mmd(x^'^\xi^}), max min > 

1 fcebl. ee[p] ' feeWMp]. ee[p] M 

^iGbfc] iebfc] 


— max < max d(x] 

[ kelp], 


W y.{k) 


, max min d{x] 
feG[L]\[p], ie[p] 

ie[nk] 


) > min min d{ 
k,£G[L]: 

j£[nk] 


Mi 


(5) 


< max max dlx . ,x . 

i,je[ne]: ‘ ^ 

ifj 

= max mmd(x^^\x^,^}). 

feG[L]\b], ^Gb] ^ 

ie[nk] 

Note that in the maximization in Q A: runs over [L] \ [p] which 
means that icp+i is guaranteed to have a generative model that 
is different from those underlying x^, ■ ■ ■, Xcj, ■ Iterating the 
preceding argument for p = 2,... ,L, we find that, indeed, 
contains exactly one observation of each generative 

model. 

Proof of Theorem^ Lets*^^i(/) := s^^\f)+w^^\f) with 
w(^)(/) = a'^^ f G [0,1), for all £ G [L], be the PSD of 
and the corresponding ACE. Define ef\f) ;= sf\f) — 
SW(/) and set e := maxfg[i]_ig[„,] supyg[o_i) |ef^(/)|. We 
have for all k,£ G [L],i G [nr], j G [uk], that 


d{xf\xf^) = i 

4' 




d/ 


df 


sf^(/) df 






df 

( 6 ) 

(7) 


Reversing the triangle inequality leading to 0 it follows 
similarly that 

dixf\xf^) > - e (8) 

for all k,£ G [L],i G [nr], j £ [n^]. Replacing d{xf\x^^^^) on 

the RHS of 0 by the upper bound in 0 and using the lower 
bound in 0 on the LHS of 0, we find that 0 is implied by 

min > 2£. (9) 

k,ielL]-. k^t 


We continue by upper-bounding e. To this end, define 

Qm e {0,1}^^“^ as {Qm)u,v = I, if V - U = TO, 
and (Qm)iij; = 0, otherwise, A4 := {—M + 1,—M -f 
2,...,M-’l}, and G(/) := EmGAT g[m] cos (27r/TO)Q^, 
i.e., Gu,v{f) = Gy^uif) = g[v - u]cos(27r/(u - u)). Now, 



















h X G the random vector containing the elements 
\ it holds for m £ M. that 

'(^)r 1 -{tit 1 IE[x Q^xj 


with X G 

xf'^ 


of 


rf'^ [m] 

With 


M 


we have 


E[x'^Q„x| 
M- 


\m\ 


sup 

ef\f) 

= sup 

/e[o,i) 


/e[0,l) 




- sup 

/e[o,i) 

= sup 
/e[o,i) 


mGZ 

(x'^QmX- E[x^Qmx]) 

^-V- 


m 


■V 

W (.’'"'(EmgAI cos(27i-/m)Q„)x 

9[™] ™s(2ir/m)Q„)x]) 

^ ^E[x^Q™x] 

^GAI -V-' mGZ 

=g[m](l-J^)f('!) [m] 

1 (10) 

— (x^G(/)x-E[x'^G(/)x]) 

'-V-' 

y]] (11) 


= sup 

/G[0.1) 


< sup 

/G[0.1) 


«r(/) 


E 


m^Z 


|r(^( [ml 


|/i[0] I 

=0 


= sup +Mmax, (12) 

/G[0.1) 

where we used the fact that g[m]{x^Qrn^ — E[x^Qmx]) 
is a real-valued even sequence, employed the dehnition of 
h[m] in Q in the step leading from ( [T0| to ( [TT] i, and invoked 
p[0] = 1 (i.e., /i[0] = 0) as well as = r^^'>[m\ + 

a^6[m] to obtain the last inequality. It follows from ( [T^ that 
e < max£g[i]_igfe^] sup^g[o_i) \af\f)\ -fAimax and hence ^ 
implies 0 via Q on the event 

af\f) + 


B.= < max sup 
U^[L],i^[ne] /g[o,l) 


M 


It remains to lower-bound which will be accomplished 

by upper-bounding the tail probability of the random variables 
af\f). For fixed /, the distribution of af\f) obeys 




1 


if) ~ ^ (y^C^G(/)Cy - E [y^C^G(/)Cy]) , 


where the entries of y are independent standard normal ran¬ 
dom variables and C = (R-|-ct^I)^/^ G with = 

— It] the (Toeplitz) covariance matrix corresponding to 
M consecutive elements of Setting B := C^G(/)C, 
we can establish a bound on the tail probability of a] by 
invoking a well-known concentration inequality for quadratic 
forms in Gaussian random vectors Lem. 1], namely 

|y^By-E[y^By]| 


> B -l-B’ 




2|!B||2_J,2<) 


< 2e 


-s 


(13) 


Next, we 

2yM||B||2^2 

llR + tr^IIL 


note that 
and IIB 

|G(/)||2^2 


B 


-B^l! 

< 


F — 


F < 2|!B|| 

2^2 ^ ,|C|1^^2l|G(/)||2^2 = 

< AiB + a'^), where we used that 


both R and G(/) are symmetric Toeplitz matrices and hence, 
Lem. 4. 1], ||R|| 2^2 < sup^g[o,i) sW(/) < B and 
2^2 < supy,g[o^i) gif + f) = sup^,g[o,i) gif') < A 


by 117, 

worn 


(here, the frequency shift by / due to the cos-factors in the 
elements of G(/) does not affect the supremum because 
gif) is 1-periodic). Now, setting 6 = 21og(M) and using 
5/M < sJsJM < 1, for M > 1, ( [T3| l yields 

2 


sup 

cf\f) 

_/e[o,i) 



21ogM 

M 


< 


M2' 


Finally, it follows from a union bound argument that 

/21ogM 


P[.F]>1-^P 

sup 

df\f) 

^G[L], 

/e[o,i) 



’if) >4A{B + a^) 


M 


> 1 - 


2N 

M2’ 
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